-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Phrases: make any2utf8 optional #1413
Conversation
What is the memory impact of this change? The conversion was there for a reason, IIRC. |
@piskvorky yes, we had a discussion here regarding this. This PR will be updated accordingly asap. |
gensim/models/phrases.py
Outdated
@@ -169,7 +169,9 @@ def learn_vocab(sentences, max_vocab_size, delimiter=b'_', progress_per=10000): | |||
if sentence_no % progress_per == 0: | |||
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" % | |||
(sentence_no, total_words, len(vocab))) | |||
sentence = [utils.any2utf8(w) for w in sentence] | |||
|
|||
sentence = [w for w in (utils.any2utf8(u'_'.join(sentence)).split('_'))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few issues here -
-
You are trying to split a bytestring (the result of the
any2utf8
call) by'_'
- this will not work in python3+ because literal strings are unicode by default. You've faced similar problems previously, so I think it would be helpful to understand character encodings at a conceptual level, and the differences between string handling in python2 and 3. -
Simply
sentences = utils.any2utf8(u'_'.join(sentence)).split('_'))
would be enough - no need for the extra[w for w in ...]
-
We're not accounting for the possibility that a word in the sentence contains
'_'
here - it would be wrong to make implicit assumptions like these about user input, unless there was an explicit constraint in the API. Escaping could be an option - although I'm not sure it is feasible, performance-wise.
gensim/models/phrases.py
Outdated
# sentence = [utils.any2utf8(w) for w in sentence] | ||
# Unicode tokens in dictionary (not utf8) | ||
|
||
sentence = [w for w in (utils.any2utf8('_'.join(w for w in sentence)).split('_'))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should all be in the optimized path (C / Cython), so there's no point wasting time playing with Python calls.
The utf8 conversion overhead will be ~0, once this is optimized properly.
You can check with cython -a
to see how much slow (Python) code is still left on the critical path. There should be none.
This whole style of optimization is not desirable. Optimize the code properly, by writing in low-level C/Cython, not by shuffling Python calls around, joining characters in Python or whatnot. That's not the way to gain a significant speed boost. Use |
@piskvorky Some notes on the wider context of this work. Unfortunately, re-writing code in C is outside of the scope of the June evaluation milestone in Prakhar's GSOC proposal submitted in March. Another part of this proposal is selecting a multi-thread/multi-process architecture for Phrases - he is running experiments for it in the joblib PR. Once the proper benchmarks for the These minor improvements to Phrases have been a good GSOC learning experience for Prakhar in preparation for FastText performance optimisation which is the main focus of his GSOC project. |
I agree, even if in the future we follow a different approach, I think the current changes are worthwhile, as they do improve the times for Phrases significantly (of course, we need clearer benchmarks, but from initial results, it certainly seems so). |
I disagree. This may be a good exercise for @prakhar2b , in preparation for fastText (like @tmylk says), but these utf8 changes obfuscate the code and are not the type of changes that the Curious to see the benchmarks :) |
gensim/models/phrases.py
Outdated
if sentence_no % progress_per == 0: | ||
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" % | ||
(sentence_no, total_words, len(vocab))) | ||
sentence = [utils.any2utf8(w) for w in sentence] | ||
if isinstance(sentence[0], bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if sentence
is empty?
gensim/models/phrases.py
Outdated
if sentence_no % progress_per == 0: | ||
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" % | ||
(sentence_no, total_words, len(vocab))) | ||
sentence = [utils.any2utf8(w) for w in sentence] | ||
if isinstance(sentence[0], bytes): | ||
sentence = [w for w in (utils.any2utf8(b';'.join(sentence)).split(b';'))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the sentence tokens contain ;
?
gensim/models/phrases.py
Outdated
@@ -133,19 +133,32 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0, | |||
`delimiter` is the glue character used to join collocation tokens, and | |||
should be a byte string (e.g. b'_'). | |||
|
|||
`recode_to_utf8` is an optional parameter (default True) for any2utf8 conversion of input sentences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
By default, the input sentences will be internally encoded to UTF-8 bytestrings, to save memory and ensure valid UTF-8. Set recode_to_utf8=False
to skip this recoding step in case you don't care about memory or if your sentences are already bytestrings. This will result in much faster training (~2x faster).
gensim/models/phrases.py
Outdated
""" | ||
if min_count <= 0: | ||
raise ValueError("min_count should be at least 1") | ||
min_count = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these two checks.
gensim/models/phrases.py
Outdated
self.min_count = min_count | ||
self.threshold = threshold | ||
self.max_vocab_size = max_vocab_size | ||
self.vocab = defaultdict(int) # mapping between utf8 token => its count | ||
self.min_reduce = 1 # ignore any tokens with count smaller than this | ||
self.delimiter = delimiter | ||
self.is_bytes = True # for storing encoding type in vocab for supporting both unicode and bytestring input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment hard to understand, and I think it's because the logic is not clear. What is it that is bytes in is_bytes
?
Isn't it better to simply create a flag of whether the input sequences are bytes
or not?
Then the comment becomes # do the input sentences consist of bytestrings?
which is clear.
Also, this check seems to belong to learn_vocab
, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT - ok, I think self.is_input_bytes
should be ok
Adding this comment (and I think it's rightly in init) -
self.is_bytes = True # do the input sentences consist of bytestrings?
# With default (recode_to_utf8=True) we encode input sentences to utf8 bytestrings, but
# with recode_to_utf8=False, we retain encoding, so need to store this encoding
# information to later convert token inputs accordingly (in __getitem__ and export_phrases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's rightly in init, then what happens if sentences=None
and user is calling learn_vocab
manually later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky updated the PR
gensim/models/phrases.py
Outdated
@@ -133,19 +133,24 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0, | |||
`delimiter` is the glue character used to join collocation tokens, and | |||
should be a byte string (e.g. b'_'). | |||
|
|||
`recode_to_utf8`- By default, the input sentences will be internally encoded to | |||
UTF-8 bytestrings, to save memory and ensure valid UTF-8. Set recode_to_utf8=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to render code blocks as literal text (put recode_to_utf8=False
in backticks).
gensim/models/phrases.py
Outdated
`recode_to_utf8`- By default, the input sentences will be internally encoded to | ||
UTF-8 bytestrings, to save memory and ensure valid UTF-8. Set recode_to_utf8=False | ||
to skip this recoding step in case you don't care about memory or if your sentences | ||
are already bytestrings. This will result in much faster training (~2x faster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing full stop at the end of sentence.
gensim/models/phrases.py
Outdated
if self.recode_to_utf8: | ||
s = [utils.any2utf8(w) for w in sentence] | ||
else: | ||
s = [utils.any2utf8(w) for w in sentence] if self.is_input_bytes else list(sentence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a bug -- recode is False but recoding happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky is it true that only unicode string
is accepted (and not bytestrings) as input token in getitem or export_phrases as mentioned here in phrases.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky this conversion was apparently meant to handle the mismatch between bytestrings training input and unicode token input . Is this an unprecedented use case, should we just raise a warning (like TypeError for differently encoded training and token input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand. If the user said "I don't want recoding", we shouldn't be recoding.
What is the motivation for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we are not doing recoding
while training Phrases.
For example, for recode_to_utf8=False
and for unicode input sentences (for training)
, we have unicode
words in vocab. Now if we provide bytestrings
tokens in getitem or export_phrases, this mismatch will be a problem here in phrases.py. (or vice-versa)
For, no recoding at all, I think user will have to use same encoded corpus for both training, and phrases retrieval (getitem / export_phrases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayantj @gojomo @menshikh-iv ref - above comment
cc @piskvorky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't be recoding implicitly in __getitem__
. The only gain is in the case when learn_vocab
receives bytestrings, recode_to_utf8
is False
and __getitem__
receives unicode. I don't think that justifies the dangerous subtle errors that it could cause.
Also, if we're going to be implicitly handling different input formats for learn_vocab
vs __getitem__
, we're also going to have to take care of the delimiter here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think delimiter issue is sorted in learn_vocab here.
Just for information, what could be those dangerous errors ? The idea is simply to do the conversion for incoming token in __getitem__
to match the encoding in vocab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @piskvorky mentioned, suppose we had recode_to_utf8=False
, and learn vocab was called with latin2 bytestrings, and latin2 bytestrings are sent to __getitem__
. We'd silently recode the latin2 bytestrings to utf8 in __getitem__
and lookup would fail, even though it should succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, yes I understand. I think fail explicitly
should be the correct choice then (for mismatch in training input and infer input for recode_to_utf8=False). What kind of error should we raise then for this mismatch ?
gensim/models/phrases.py
Outdated
if self.recode_to_utf8: | ||
s = [utils.any2utf8(w) for w in sentence] | ||
else: | ||
s = [utils.any2utf8(w) for w in sentence] if self.is_input_bytes else list(sentence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a bug -- recode is False but recoding happens.
gensim/models/phrases.py
Outdated
if self.recode_to_utf8: | ||
s = [utils.any2utf8(w) for w in sentence] | ||
else: | ||
s = [utils.any2utf8(w) for w in sentence] if self.is_input_bytes else list(sentence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a bug -- recode is False but recoding happens.
gensim/models/phrases.py
Outdated
"""Collect unigram/bigram counts from the `sentences` iterable.""" | ||
if not self.recode_to_utf8 and sentences is not None: | ||
sentence = list(next(iter(sentences))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this will raise an exception if sentences is either an empty list or an empty generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, ideally a simple test should be catching this too.
"source": [ | ||
"# currently on develop --- original code\n", | ||
"from gensim.models import Phrases\n", | ||
"bigram = Phrases(Text8Corpus(text8_file))\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better convert the streamed iterable to an in-memory list (using list()
), it's small enough. That way we don't have to iterate over the file from disk every time.
This will make the benchmark conclusions stronger (less noise and delays from other, unrelated parts of the code, IO overhead etc).
gensim/models/phrases.py
Outdated
try: | ||
sentence = list(next(iter(sentences))) | ||
except: | ||
raise ValueError("Input can not be empty list or generator.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1: why should empty input suddenly become a special case, raising an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this concern was raised here in the review comment above, and also as discussed on gitter, sentences= [ ]
is not None but it will throw an error with iter
in above case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that's a reason to fix the bug, not change the API :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I should have just logged a warning Empty sentences provided as input
in except block and no need to raise the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless it's really a special case (and I don't think it is), there's no need to treat it in a special way.
No exception, no warning -- there's nothing special to an empty corpus with regard to Phrases, except your new check for its first element. It's not a special case.
@prakhar2b can you please fix the two issues I pointed out last week ( |
gensim/models/phrases.py
Outdated
self.delimiter = utils.to_unicode(self.delimiter) | ||
self.is_input_bytes = False | ||
sentences = it.chain([sentence], sentences) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using catch-all except blocks is generally a bad idea, since you could end up catching unexpected exceptions. So catch only the specific exception expected here (StopIteration
?)
gensim/models/phrases.py
Outdated
self.is_input_bytes = False | ||
sentences = it.chain([sentence], sentences) | ||
except: | ||
# No need to raise any exception or log any warning, as it's not a special case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug message here would serve well too.
I have a question regarding behaviour of phrases with
If we do I added this test (for bytestrings infer input) for |
Not sure how that comes about, but looks like a bug to me @prakhar2b (not intended behaviour). |
What's status @prakhar2b? |
@menshikh-iv should I reopen my #1454 , since this was closed? |
@piskvorky I think no, now Filip works with #1446 as I understand |
@menshikh-iv #1446 is completely orthogonal to this. You can convert to utf8 with/without a memory-bounded counter. |
Updated benchmark (for text8)
Speed improvement in Phrases module.
Phrases optimization benchmark (for text8)
Note-- Leaving it for the context of prior discussions